Coping with silent errors in HPC applications

نویسندگان

  • Guillaume Aupy
  • Anne Benoit
  • Massimiliano Fasi
  • Yves Robert
  • Hongyang Sun
  • Bora Uçar
چکیده

This report describes a unified framework for the detection and correction of silent errors, which constitute a major threat for scientific applications at extremescale. We first motivate the problem and explain why checkpointing must be combined with some verification mechanism. Then we introduce a general-purpose technique based upon computational patterns that periodically repeat over time. These patterns interleave verifications and checkpoints, and we show how to determine the pattern minimizing expected execution time. Then we move to application-specific techniques and review dynamic programming algorithms for linear chains of tasks, as well as ABFT-oriented algorithms for iterative methods in sparse linear algebra. Key-words: Fault-tolerance, checkpointing, silent errors, error detection, error correction, algorithm-based fault tolerance, linear task graphs, conjugate-gradient algorithm. ∗ Penn State University, USA, [email protected] † CNRS, École Normale Supérieure de Lyon & INRIA, France, {anne.benoit|aurelien.cavelan|yves.robert|hongyang.sun|bora.ucar}@ens-lyon.fr ‡ University of Manchester, UK, [email protected] § University of Tennessee Knoxville Détection et correction des erreurs silencieuses dans les applications de calcul scientifique à haute performance Résumé : Nous décrivons dans ce rapport un modèle unifié pour la détection et la correction des erreurs silencieuses dans les applications de calcul scientifique à haute performance. Nous proposons d’abord une méthode générale à base de schémas de calcul périodiques qui combinent checkpoints et vérifications. Puis nous traitons de deux cas particuliers, à savoir les châınes de tâches et les solveurs linéaires creux. Mots-clés : résilience, erreurs silencieuses, checkpoint, ABFT, produit matrice-vecteur creux. Coping with silent errors in HPC applications 3 Time Xs Xd Error Detection Figure 1: Error and detection latency.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Coping with recall and precision of soft error detectors

Many methods are available to detect silent errors in high-performance computing (HPC) applications. Each method comes with a cost, a recall (fraction of all errors that are actually detected, i.e., false negatives), and a precision (fraction of true errors amongst all detected errors, i.e., false positives). The main contribution of this paper is to characterize the optimal computing pattern f...

متن کامل

Coping with silent and fail-stop errors at scale by combining replication and checkpointing

This paper provides a model and an analytical study of replication as a technique to detect and correct silent errors, as well as to cope with both silent and fail-stop errors on large-scale platforms. Fail-stop errors are immediately detected, unlike silent errors for which a detection mechanism is required. To detect silent errors, many application-specific techniques are available, either ba...

متن کامل

Exploring Partial Replication to Improve Lightweight Silent Data Corruption Detection for HPC Applications

Silent data corruption (SDC) poses a great challenge for high-performance computing (HPC) applications as we move to extremescale systems. If not dealt with properly, SDC has the potential to influence important scientific results, leading scientists to wrong conclusions. In previous work, our detector was able to detect SDC in HPC applications to a certain level by using the peculiarities of t...

متن کامل

Two-level checkpointing and partial verifications for linear task graphs

Fail-stop and silent errors are unavoidable on large-scale platforms. Efficient resilience techniques must accommodate both error sources. A traditional checkpointing and rollback recovery approach can be used, with added verifications to detect silent errors. A fail-stop error leads to the loss of the whole memory content, hence the obligation to checkpoint on a stable storage (e.g., an extern...

متن کامل

Multi-level checkpointing and silent error detection for linear workflows

We focus on High Performance Computing (HPC) workflows whose dependency graph forms a linear chain, and we extend single-level checkpointing in two important directions. Our first contribution targets silent errors, and combines in-memory checkpoints with both partial and guaranteed verifications. Our second contribution deals with multi-level checkpointing for failstop errors. We present sophi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016